Data Gathering

To see all raw data gathered click here

Quarterly and Annual Ridership Totals by Mode​ of Transportation

The initial piece of data that was gathered comes from the American Public Transportation Association, and can serve as an introductory synopsis of the state of public transit ridership over time. This gives a broad view of quarterly ridership across the entire country from 1990 onward. Thus, this data has been chosen for the potential of setting the stage for the problem which we intend to explore.

The raw data and methodology for how it was obtained can be found using this link: https://www.apta.com/research-technical-resources/transit-statistics/ridership-report/

The data itself can be downloaded using this link: https://www.apta.com/wp-content/uploads/APTA-Ridership-by-Mode-and-Quarter-1990-Present.xlsx

To download this data, I used an R API tool, which saves the data in Excel format. Below is the code for this action and a screenshot of the raw data to illustrate its form upon download:

Code
library(readxl)
library(httr)
url1<-'https://www.apta.com/wp-content/uploads/APTA-Ridership-by-Mode-and-Quarter-1990-Present.xlsx'
GET(url1, write_disk(tf <- tempfile(pattern = "APTA-Ridership-by-Mode-and-Quarter-1990-Present", fileext = ".xlsx", tmpdir = "../data")))
df <- read_excel(tf, 2L)
str(df)

Quarterly and Annual Ridership Totals by Mode​ of Transportation

News API Data

An essential part of understanding public perception of a topic is by assessing how it is covered in the news. This often informs general opinions, and can introduce conversations that had not previously been in the zeitgeist. Thus, this paper will analyze text data from https://newsapi.org/ to allow us to study news coverage on two distinct public transit systems.

For this project, I will be looking at data regarding the Washington Metropolitan Area Transit Authority (WMATA) and the Bay Area Rapid Transit (BART). Both of these transit systems have several advantages for academic study: they are both large networks with rich histories and connections to their respective cities, there exist robust data sources allowing us to analyze information from several angles, and comparing them will allow us to get perspectives on differences between cities on opposite coasts.

The following shows how I accessed this News API via Python code. This outputs a JSON file as raw data, the start of which is included below each code block to show the nature of the data prior to cleaning.

Code
import requests
import json
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

API_KEY='581fd71df234408291300dc13f0ee6e8'
TOPIC='wmata'

URLpost = {'apiKey': API_KEY,
            'q': '+'+TOPIC,
            'sortBy': 'relevancy',
            'totalRequests': 1}

response = requests.get(baseURL, URLpost)
response = response.json()

with open('WMATA-newapi-raw-data.json', 'w') as outfile:
    json.dump(response, outfile, indent=4)

Raw JSON data from News API; Topic: WMATA
Code
baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

API_KEY='581fd71df234408291300dc13f0ee6e8'
TOPIC='Bay Area Rapid Transit'

URLpost = {'apiKey': API_KEY,
            'q': '+'+TOPIC,
            'sortBy': 'relevancy',
            'totalRequests': 1}

response = requests.get(baseURL, URLpost)
response = response.json()

with open('BART-newapi-raw-data.json', 'w') as outfile:
    json.dump(response, outfile, indent=4)

Raw JSON data from News API; Topic: Bay Area Rapid Transit

Ridership by Hour

In addition to the volume of public transit usage, we can glean information on the purpose of public transit usage by analyzing the users by hour of the day. High peaks during “rush hour” likely indicate a great influence of work commuting on the data. Because of this, I downloaded both pre-pandemic and post-pandemic data sets regarding WMATA ridership by hour to view this relationship and whether it has changed due to new circumstances. In this case, March 17, 2020 is chosen as the demarcation date, as that was the day in which the first social distancing precautions were announced in Washington, D.C. The data is shown below:

Hourly Ridership from 1/1/2018 to 3/17/2020

Hourly Ridership from 3/18/2020 to 10/5/2023

Ridership by Demographic

In answering the question of whether or not public transit’s public service should be the paramount consideration for its efficacy, it is important to understand that it often provides service disproportionally to underprivileged groups. By analyzing demographic data, we can gather insights on who benefits most from robust public transit systems. To address this, there is data from the U.S. Census Bureau that provides 5-year estimates from 2021 of means of transportation to work by selected characteristics. The raw data is shown below:

2021 5-Year Estimate of Transportation Means by Demographic